Probabilistic Model of Nodes

Now that we have a way of splitting up cells, it makes sense to take a look at what the cell type says about the cell type that is likely to follow it. I want to answer, for example:

Given that a cell is an 'import', what is the probability that we see an expression next?

Intuitively, we should expect order here. This could be the basis for further experiments: choosing classes to maximize the order we find in these probabilities.

There are two ways of looking at this problem, that differ in how we choose a class for a node:

  1. We can find a way to sort the nodes into classes manually (for example, a simplistic method would be to use the node type as the class
  2. We can find a way to map the nodes to Euclidean space, cluster them into groups, and assign each group to a class.

We look at both of these ways of thinking using the following two metrics:

  1. Use the node type as a class
  2. Use the number of nodes in the node's tree as a feature, and cluster by size

In [1]:
# Necessary imports 
import os
import time
from nbminer.notebook_miner import NotebookMiner
from nbminer.cells.cells import Cell
from nbminer.features.ast_features import ASTFeatures
from nbminer.stats.summary import Summary
from nbminer.stats.multiple_summary import MultipleSummary

In [2]:
#Loading in the notebooks
people = os.listdir('../testbed/Final')
notebooks = []
for person in people:
    person = os.path.join('../testbed/Final', person)
    if os.path.isdir(person):
        direc = os.listdir(person)
        notebooks.extend([os.path.join(person, filename) for filename in direc if filename.endswith('.ipynb')])
notebook_objs = [NotebookMiner(file) for file in notebooks]
a = ASTFeatures(notebook_objs)

In [3]:
for i, nb in enumerate(a.nb_features):
    a.nb_features[i] = nb.get_new_notebook()

In [4]:
from helper_classes.cond_computer import CondComputer
node_list = []
for i, nb in enumerate(a.nb_features):
    node_list.append('start')
    for cell in (nb.get_all_cells()):
        t = type(cell.get_feature('ast').body[0])
        node_list.append(t)
    node_list.append('end')
cc = CondComputer(node_list)

In [5]:
arr, arr_names = cc.compute_probabilities(cc.count_totals,.01)

In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (20, 10)

cc.plot_bar(arr, arr_names, 'Probability per Node type')


Plots

The following are plots that show the conditional probabilities for each of the cell types


In [7]:
cc.plot_conditional_bar(arr, arr_names, 0, 'Probability per Node type')


<class '_ast.Import'>

In [8]:
cc.plot_conditional_bar(arr, arr_names, 1, 'Probability per Node type')


<class '_ast.ImportFrom'>

In [9]:
cc.plot_conditional_bar(arr, arr_names, 2, 'Probability per Node type')


<class '_ast.Expr'>

In [10]:
cc.plot_conditional_bar(arr, arr_names, 3, 'Probability per Node type')


<class '_ast.Assign'>

In [11]:
cc.plot_conditional_bar(arr, arr_names, 4, 'Probability per Node type')


<class '_ast.For'>

In [12]:
cc.plot_conditional_bar(arr, arr_names, 5, 'Probability per Node type')


<class '_ast.FunctionDef'>

In [ ]:

Looking at Node size conditional probabilities

Now that we have a good idea of the predictive power of the node type, and how assigning each node to a class could work, lets take a look at how a clustering on a feature space works. The clustering was done by arbitrarily choosing bins such that each bin had roughly the same number of examples


In [13]:
ast_sizes = []
for i, nb in enumerate(a.nb_features):
    nb.set_ast_size()
    for el in nb.get_all_cells():
        ast_sizes.append(el.get_feature('ast_size'))

In [14]:
bin_end = [4, 7, 9, 12, 15, 22, 36]
bin_count = {}
for el in bin_end:
    bin_count[el] = 0
for num in ast_sizes:

    for i in range(len(bin_end)):
        if num < bin_end[i]:
            bin_count[bin_end[i]] += 1
            break
names = ['Less than ' + str(bin_end[0])]
for i in range(1,len(bin_end)-1):
    names.append(str(bin_end[i-1]) + ' <= Num Nodes < ' + str(bin_end[i]))
names.append('Greater than' + str(bin_end[-1]))

In [15]:
for key in bin_count.keys():
    print (key, bin_count[key])


4 2041
7 1564
9 2994
12 4255
15 2175
22 3627
36 2277

In [16]:
size_features = []
for i, nb in enumerate(a.nb_features):
    nb.set_ast_size()
    size_features.append('start')
    for el in nb.get_all_cells():
        num = el.get_feature('ast_size')
        for ind in range(len(bin_end)):
            if num < bin_end[ind]:
                size_features.append(ind)
                break
    size_features.append('end')

In [17]:
cc = CondComputer(size_features)

In [18]:
arr, arr_names = cc.compute_probabilities(cc.count_totals,0,np.arange(7))

In [19]:
cc.plot_bar(arr, names, 'Probability per Node size')



In [20]:
cc.plot_conditional_bar(arr, arr_names, 0, 'Probability per Node size', x_labels = names)


0

In [21]:
cc.plot_conditional_bar(arr, arr_names, 1, 'Probability per Node size', x_labels = names)


1

In [22]:
cc.plot_conditional_bar(arr, arr_names, 2, 'Probability per Node size', x_labels = names)


2

In [23]:
cc.plot_conditional_bar(arr, arr_names, 3, 'Probability per Node size', x_labels = names)


3

In [24]:
cc.plot_conditional_bar(arr, arr_names, 4, 'Probability per Node size', x_labels = names)


4

In [25]:
cc.plot_conditional_bar(arr, arr_names, 5, 'Probability per Node size', x_labels = names)


5

In [26]:
cc.plot_conditional_bar(arr, arr_names, 6, 'Probability per Node size', x_labels = names)


6

Conclusion

We find that there is definitely predictive power in both methods of defining classes. The results generally correspond with our preconcieved notion of how likely a certain cell type is to appear after another. There were some interesting correlations in the cell size experiments, and we are interested in both loooking into these correlations and also generating new features to test this with.


In [ ]: